Open
Conversation
|
So that's why! I would love to see a release of chardet that fixes this bug. As it is, chardet basically can't be used for latin-1, and that's the most common single-byte encoding. (Well, Windows-1252 is, but chardet doesn't really distinguish those.) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi David,
Following up on our exchange from a few weeks ago. I've commited the bug fix for the integer math bug (causes a file with even one "low confidence" character to have 0 confidence level) and also updated the confidence multiplier for latin 1 to one that works better (the original multiplier causes many files to be incorrectly detected as iso-latin-2). Testing on both the problem case mentioned ( http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML ) and the character set tables from http://www.columbia.edu/kermit/csettables.html suggests that the fixed code performs better.
Hope this is useful - it's certainly helped fix a few problem documents in the application I'm working on.
Best,
Doug